User segmentation based on behavior - mobile app analysis

Data overview

Source

There are 3 sources.

There are no duplicates

Dataset

There are no duplicates

Conclusion:

Source:

Dataset:

Data preprocessing

Conclusion:

Exploratory data analysis

Source

All users in the source table are unique.

Now you can evaluate the sources in terms of user acquisition.

Among the top sources of traffic, Yandex ranks first, followed by Other (other sources), and Google in third place.

It appears that more money is spent on buying traffic in Yandex than in Google. It is important to check whether this traffic pays off. If the performance of traffic from Google is not worse, then the redistribution of budgets can be considered.

Dataset

The number of unique users is the same as in the source table.

Data is available from October 7 to November 3, 2019. There is almost a 28-day time difference.

Checking for anomalies in user activity

Let's see if there are very active users, as their actions may distort further analysis.

Total events 4293, average events per user 17.3, median 9, maximum events per user 478.

There are anomalously active users present. Let's look at the percentiles to figure out where the anomaly borders.

As can be seen from the graph, users performing more than 100 actions are rather an exception. After calculations, only 1% of users perform more than 132 actions in the application. Let's remove them from further analysis.

In addition, users who have completed one action will not be useful for our study.

There aren't many of these users. Let's delete them.

A total of 109 users have been removed for performing one or more than 132 within the application.

Checking the distribution of events by day

The next step is to calculate the distribution of events by day.

Since we see very few tips_click events, the vast majority of tips_show events (the user saw recommended ads) are automatic and do not indicate interaction with the mobile application. Let's try to visualize this without tips_show.

Building funnel of sales

Tips_show is the most frequent event.

Calculating how many unique users completed each of these events. Sorting events by the number of unique users. Calculating the proportion of unique users who completed at least one event.

Tips_show was completed by the most users.

Visualizing funnel of sales where the site search is used to determine what percentage of unique users go to the next step.

Let's form a dataframe in which tips_show users will be removed. Let's leave only search, advert_open, photos_show, and contacts_show cards.

When building a funnel of sales with search, advert_open, photos_show, contacts_show cards, we found out that:

It takes a lot of time to search and only 8% of users open ads. Users probably have the choice to open the ad or the photos right away. 74% of users open photos immediately + 8% of users open ads = 82%. 27% of initial users view contacts.

Visualizing funnel of sales where the tips_show is used to determine what percentage of unique users go to the next step.

When building a funnel of sales with tips_show, tips_click, favorites_add, contacts_show cards, we found out that:

The distorted shape of the funnel tells us that there are problems in the processes. Many users enter the funnel, but only a few make it to the next stage. This indicates a marketing error - showing the wrong ads to the wrong audience. There is a need to improve the ad serving system, since only 18% of contacts are viewed.

Identifying how users use the application

If the user's first and last visit fall on the same day, then the difference will be 0. Let's add one to find out how many days the application was used.

During the selected period, most users used the application for only one day.

Let's see how long users used the app and what events were the first to occur.

Let's create sessions_start dataframe, which will contain the first session events, those that occurred more than 30 minutes after the previous one, or the events that were the first for the user. Also, let's create a column that will contain the iD of the first session event.

Each user has on average 2 sessions. There are almost 7 events per session, ranging from 1 when users exit the application to 104 when they actively search.

Since users return from the background, any event becomes the start of the session.

Calculating the average time from the session's first event to the target (contacts_show)

Let's create a table of users that contains the date of the first event and the date of the first target event.

"Contacts_show" appeared in 1478 sessions, and there were 9428 sessions in total.

Several users with "time_delta" == 0 belong to the "contacts_show" event.

With contacts_show, 506 sessions started.

Even though such sessions represent 34.24% of the total, we will discard them to avoid distorting the results.

We have 972 sessions left.

We calculated the mean and median time between the first and target event for 724 users.

Check out the histogram for the distribution of time in minutes.

There is a maximum time of 0 days 01:38:05 between the first and target events, an average of 0 days 00:09:21, and a median of 0 days 00:04:49. Based on the histogram, the most values fall on the ~20 minutes.

Comparing users by the frequency of events

Consider the behavior of people who made the contacts_show target event and those who did not.

There are no such users.

Who completed the target event:

Analyzing events based on their conversion to the target event - contacts_show

Let's calculate conversions to the target event in the context of sessions using the time_diff table we created earlier.

A high conversion rate is found in 'search' (12.32). Perhaps because users entered the application and were already looking for specific items. Followed by photos_show. Tips_click has the lowest conversion rate.

Comparing users based on retention rates

Identify the groups of users who return frequently to the mobile application (retention rate).

Let's look at the hitmap.

Users who were attracted on October 14 showed the highest retention rates, as indicated by the light spots on the chart.

Conclusion:

User segmentation

It is necessary to generate clustering features in order to perform clustering.

Standardizing data

Training a clustering model based on the K-Means algorithm and predicting customer clusters

There are 64 users in the smallest group and 3294 users in the largest group.

Calculating silhouette metrics for clustering

A silhouette score can range from -1 to 1. The closer to 1, the better the clustering. Based on the Silhouette_score = 0.37, the clustering went well.

Identifying feature mean values for clusters

Let's look at the key metrics.

Clusters have the following characteristics

Group 0

Despite being the second largest group (590 users), this is not the most active group. Users in this group make a small number of events (18) and sessions (4), active for 12 days. The vast majority of users see ads in Yandex recommendations, but seldom open them and do not view photos or contacts. There is a better situation than in group 1, but it is much worse than in groups 2 and 3. Advertisements may be saved to favorites and purchases delayed.

Group 1

This group has the most participants (3294) and is the most passive. There are 10 events in total, 1 session and 1 day of activity. A majority of users come from the recommendation system of all three search engines and rarely search for something themselves. Users open ads occasionally, look at photos a little bit, and rarely make the target event - viewing contacts - themselves.

Group 2

The smallest group of users (64). Users take part in 65 events and 12 sessions during a 12 day period. A lot of users search for ads on their own, look at photos and view contacts. Google and Yandex are the leading sources of traffic. These users clearly understand the product / service they are looking for and therefore most often only use search.

Group 3

It is the third largest (236) and most active group. Users perform more events in this groups (67), and quite a few sessions (5) are active within the past week. Most users open ads in Yandex recommendations, but rarely see photos. They use contacts as a target event. Number of contacts_shows per user is the winner.

Let us construct feature distributions for clusters.

The following features of clusters can be identified from the graph:

The first and second clusters have good event_count, session_count, and lifetime indicators, but poor event_advert_open, event_photos_show, and event_contacts_show.

Analyzing the time difference between user events from different clusters

In the time_diff table we calculated the time from the first to the target event in the time_delta column. Let's combine this table with mobile_clusters to see how the times differ between the events of users from different clusters.

It takes 9:02 minutes for group 2 to reach the target event, while group 3 takes 12:38 minutes. Group leader 3 spends the most time since it performs more actions.

The time between the most inactive group 1 and the slightly more active group 0 is practically the same. Group 1 spends 9:34 minutes and group 0 spends 8:54 minutes.

Identifying clusters that convert to target events more frequently (conversion to a target event)

The highest conversion rate was in group 2 (62%).

Identifying clusters that return to the mobile app frequently (Retention rate)

Retention rate for cluster 0

The users aren't inclined to return, so they mostly see Yandex recommendations with ads and are unlikely to search for something on their own.

Retention rate for cluster 1

After four weeks, Group 1 (the most passive group) did not return.

Retention rate for cluster 2

Group 2 has the highest retention rate, perhaps because users like to search for things on their own.

Retention rate for cluster 3

It is interesting to note that users are less inclined to return to group 3 despite the fact that it is one of the leading groups. This is perhaps because they are more likely to view ads through search engine recommendations and not interested in searching independently.

Conclusion:

Motivation recommendations for users in groups 0 and 1:

The ad serving system needs to be improved. Users see ads in all search engines, but are not interested in looking at photos or even contacts. Users are not shown what is relevant them. It is necessary to find out what interests them and build algorithms based on this information.

Testing statistical hypotheses

  1. Some users installed the application using a link from yandex, others from google. Test the hypothesis: in both groups, conversions to contact views have statistically significant differences.

  2. Test the hypothesis: site search conversions to contact views and recommender conversions to contact views have statistically significant differences.

Let's formulate the null and alternative hypotheses:

At a given critical level of statistical significance, there are no statistically significant differences between the conversions of users who installed the application via a link from yandex and users who installed the application via a link from google. The source has no effect on the target event's views.

Hypothesis 2: Site search (search) to contact views and recommender (tips_show) to contact views conversions are different.

What we have:

Let's formulate the null and alternative hypotheses:

At a given critical level of statistical significance, there are statistically significant differences between conversion rates of site search (search) to contact views and recommendation system (tips_show) to contact views.

Conclusion:

Final conclusion and recommendations

Exploratory data analysis:

User segmentation:

Hypotheses testing:

Recommendations:

The ad serving system needs to be improved. Users see ads in all search engines, but are not interested in looking at photos or even contacts. Users are not shown what is relevant them. It is necessary to find out what interests them and build algorithms based on this information.